Goto

Collaborating Authors

 visual graph



GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning

Neural Information Processing Systems

Large Language Models (LLMs) are increasingly used for various tasks with graph structures. Though LLMs can process graph information in a textual format, they overlook the rich vision modality, which is an intuitive way for humans to comprehend structural information and conduct general graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., $\textit{visual graph}$) are still unexplored.




GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning

Neural Information Processing Systems

Large Language Models (LLMs) are increasingly used for various tasks with graph structures. Though LLMs can process graph information in a textual format, they overlook the rich vision modality, which is an intuitive way for humans to comprehend structural information and conduct general graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., \textit{visual graph}) are still unexplored. To fill the gap, we innovatively propose an end-to-end framework, called \textbf{G} raph to v \textbf{I} sual and \textbf{T} extual Integr \textbf{A} tion (GITA), which firstly incorporates visual graphs into general graph reasoning. Extensive experiments on the GVLQA dataset and five real-world datasets show that GITA outperforms mainstream LLMs in terms of general graph reasoning capabilities.


Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

Zhu, Yingjie, Bai, Xuefeng, Chen, Kehai, Xiang, Yang, Zhang, Min

arXiv.org Artificial Intelligence

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through 3 self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs' zero-shot performance on fundamental graph learning tasks, as well as enhancing the robustness of LVLMs against complex visual graphs.


Rendering Graphs for Graph Reasoning in Multimodal Large Language Models

Wei, Yanbin, Fu, Shuai, Jiang, Weisen, Kwok, James T., Zhang, Yu

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly used for various tasks with graph structures, such as robotic planning, knowledge graph completion, and common-sense reasoning. Though LLMs can comprehend graph information in a textual format, they overlook the rich visual modality, which is an intuitive way for humans to comprehend structural information and conduct graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., visual graph) is still unexplored. In this paper, we take the first step in incorporating visual information into graph reasoning tasks and propose a new benchmark GITQA, where each sample is a tuple (graph, image, textual description). We conduct extensive experiments on the GITQA benchmark using state-of-the-art multimodal LLMs. Results on graph reasoning tasks show that combining textual and visual information together performs better than using one modality alone. Moreover, the LLaVA-7B/13B models finetuned on the training set achieve higher accuracy than the closed-source model GPT-4(V). We also study the effects of augmentations in graph reasoning.


SCOPE: Structural Continuity Preservation for Medical Image Segmentation

Yeganeh, Yousef, Farshad, Azade, Guevercin, Goktug, Abu-zer, Amr, Xiao, Rui, Tang, Yongjian, Adeli, Ehsan, Navab, Nassir

arXiv.org Artificial Intelligence

Although the preservation of shape continuity and physiological anatomy is a natural assumption in the segmentation of medical images, it is often neglected by deep learning methods that mostly aim for the statistical modeling of input data as pixels rather than interconnected structures. In biological structures, however, organs are not separate entities; for example, in reality, a severed vessel is an indication of an underlying problem, but traditional segmentation models are not designed to strictly enforce the continuity of anatomy, potentially leading to inaccurate medical diagnoses. To address this issue, we propose a graph-based approach that enforces the continuity and connectivity of anatomical topology in medical images. Our method encodes the continuity of shapes as a graph constraint, ensuring that the network's predictions maintain this continuity. We evaluate our method on two public benchmarks on retinal vessel segmentation, showing significant improvements in connectivity metrics compared to traditional methods while getting better or on-par performance on segmentation metrics.


8 Tools Every Data Scientists Should Use

#artificialintelligence

I explain Artificial Intelligence terms and news to non-experts. Two years ago, I saw my first research paper ever. I remember how old it looked and how discouraging the mathematics inside was. It really did look like what the researchers worked on in movies. To be fair, the paper was from the 1950s, but it hasn't changed much since then.